## Importing Libraries

## Load the data

From the Histogram and barcharts, majority of the stores experience sales betweeen 10 and 20 million with above average stores seeing sales of between 20 and 30 million. The high sales store have greater than 40 million while low sales chain have less than 10 million in sales.

The Sankey DAIAGRAM does not a signficant varaition in product sold with greatest sales going to food and beverages. A distant seconf is personal care and housedhold goods. perhaps a through introspetion of the individual product family instead of the broad categories would have shown some more intresting details in product sold by different clusters.

Majority of the sales (OVER 580 MILLION) are in Pinchnch state with store clusters A and D dominating state. Pastaza state has the lowest sles with 5 million in sales and domonated by store cluster C.

The line chart does not appear to show any signficant correaltion betweeen the number of promotion and sales for Automotive, Clothing, and Electronics. However this lack of direction maybe as a result of the low sales. There appears to be strong positive realtaionship in the Food category. It is likely that promotions are highly likely to increase sales in this category. The most interesting relationship is the one between sales of personal, Home appliance and specialty productsand promotions. Despite a signficant increase in promotional activity the sales appear to be constant. These are relationships that could be further examined by more instropsection into the data.

Observed: This plot s captures the overall pattern, including trends, seasonality, and noise. The fluctuations and spikes in sales are visible, with some noticeable peaks around 2016. Trend: The second plot reveals the general direction of sales over time, smoothing out short-term fluctuations. From this, we can observe that sales generally increased from 2013 to 2016, with some notable dips and peaks, and then a slight decline towards the end of the period. Seasonal: The third plot displays the seasonal component, capturing regular patterns that repeat yearly. The plot shows consistent periodic fluctuations, indicating regular seasonal variations in sales. The amplitude and frequency of these seasonal changes appear consistent throughout the period. Residual: The bottom plot represents the residual component, which includes irregular or random fluctuations after removing the trend and seasonal components. It shows the noise or randomness in the sales data that isn't explained by the trend or seasonality. There are some noticeable spikes hich could be explained by holidays aand other seasonal events.

These graph confirms the initial assumption that fluactuations in sales may be infleunced by national holidays with most of the peaks happeing before or during hoidays and the dips after and during the events.

The graph on moving averages and rolling standard deviation confirm the trends seen with the Time series decomposition.

Significant spikes at certain lags indicate that those lags contribute additional information to the time series, beyond what is explained by the previous lags. The blue shaded area represents the confidence interval, and spikes outside this area are statistically significant. The ACF plot shows strong correlations at weekly intervals (lags 7, 14, 21, 28), indicating a weekly seasonal pattern in the sales data. The PACF plot suggests that the immediate past values (lag 1) and weekly patterns (around lag 7) are important for predicting future sales.

Given the characteristics of this data we can use SARIMAX to forecast sales. This model wiill allow us to captures both short-term and long-term dependencies within the data, while accounting for seasonality. The Next Decision was to decide on whether to use a single, hybrid, or multi problem approach ? 1. Single - solving the prices for all families of products and all stores at the same time (1 model) 2. Multi problem - solving the prices for one family of products and one store at the same time (1780 simpler model) 3. Hybrid - enginnering features that combine product families, store clusters, etc to have few models and increase accuracy from the single model. I opted for a multi problem approach since I realized I did not have a contextual understanding of the problem.

The model appears to fit the data well, given the significant coefficients for the AR terms and the seasonal MA term. The high AIC and BIC values suggest that while the model includes many parameters, its complexity is justified by the data. The diagnostics indicate some issues: residuals are not normally distributed, and there is heteroskedasticity present, which could affect the reliability of the model's predictions.

The 15 day forecast generates results that are consistent with the original pattern of the data. The accuracy of these predictions can be ascertained by evaluating the model using a subset of the train data.

Model Performance: RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) vary widely across stores. Stores with the lowest RMSE and MAE: Store 35 (RMSE: 888.22, MAE: 648.90), Store 23 (RMSE: 825.03, MAE: 683.34). Stores with the highest RMSE and MAE: Store 44 (RMSE: 8133.22, MAE: 6279.13), Store 46 (RMSE: 6406.22, MAE: 4955.34). The variability in the model's performance across different stores is perhaps acceptable given that the data was not normalized.

Lowest Errors: BABY CARE has the lowest RMSE and MAE, indicating perfect model performance. BOOKS and HARDWARE also show very low errors, suggesting good model performance in these categories. High Errors: BEVERAGES has the highest RMSE and MAE, which suggests significant prediction errors. This might indicate issues with the model or data quality for this product family. GROCERY I also has very high RMSE and MAE, pointing to poor model performance. Moderate Errors: Product families like CLEANING, DAIRY, and BREAD/BAKERY have moderate errors. While not as high as BEVERAGES or GROCERY I, there is room for improvement.